ORDER BY and Unicode - Mailing list pgsql-novice

From M. Bastin
Subject ORDER BY and Unicode
Date
Msg-id a06100515bcc7c7cdeb35@[192.168.0.101]
Whole thread Raw
In response to pg_dump problem  (Noel Faux <noel.faux@med.monash.edu.au>)
Responses Re: ORDER BY and Unicode
List pgsql-novice
There seems to be a big problem with Unicode for
which a solution might already exist.  Somebody
had the following problem on another mailing
list.  My suggestion is at the bottom of this
message but if another solution already exists
I'd like to hear about it.

The problem is that special characters aren't
treated right under Unicode.  Here are a few
examples:

1.   "UPPER('é')" doesn't work.
(That's an accented "e" in there if it doesn't
come through in your e-mail application)

The implication of this is that

     SELECT ... WHERE UPPER(mycolumn) LIKE UPPER('my search string')

doesn't give the functionality you'd want.
UPPER and LOWER seem to work on ASCII only.  The
Greek, French, etc. are out of luck.


2.  "ORDER BY mycolumn" gives a wrong sort order.

Uppercase ASCII characters come first, then
lowercase ASCII, then accented characters...
This really isn't what a human would like to see.


I think the two examples above illustrate this
Unicode problem quite well.  Is there an existing
solution?  If not could we work together on
creating one, as suggested at the very bottom of
this message?

Thanks,

Marc

------------------------------------------------------

You can use the translate function to solve your problem.
<http://www.postgresql.org/docs/7.4/interactive/functions-string.html>

e.g. for the letter "a":

    SELECT * FROM mytable ORDER BY
translate(textcolumn, 'àáâäÀÁÂÄa', 'AAAAAAAAA');

Then you build an index like this to speed things up:

   CREATE INDEX MyIndex ON MyTable
(translate(textcolumn, 'àáâäÀÁÂÄa', 'AAAAAAAAA'));

And your select queries will also be case and
accent independent from then on e.g:

    SELECT * FROM mytable WHERE
translate(textcolumn, 'àáâäÀÁÂÄa', 'AAAAAAAAA')
LIKE translate('my search string', 'àáâäÀÁÂÄa',
'AAAAAAAAA');


Since the alphabet is very long you're better off
creating your own function that does this
translation for the whole alphabet and then you'd
get something like this:

    SELECT * FROM mytable WHERE
MySimpleABC(textcolumn) LIKE MySimpleABC('my
search string') ORDER BY MySimpleABC(textcolumn);

Your index would be like this:

   CREATE INDEX MyIndex ON MyTable (MySimpleABC(textcolumn));

Cheers,

Marc

PS:  Maybe we should work together to create the
mother of all functions that would do this for
all of Unicode?  Anybody else up to this?

pgsql-novice by date:

Previous
From: Sandro Martinez
Date:
Subject: ayuda
Next
From: Stephan Szabo
Date:
Subject: Re: ORDER BY and Unicode